Talk:The Priestess Warrior and the Mithril Giants/@comment-93.177.176.211-20170705134328/@comment-26851283-20170708063641
@ssvb, ok perhaps I overreacted slightly. If data collection proceeds without obfuscation and until cricial sample size is reached, adjusting the starting criteria should not usually affect the statistical results. The danger really is when the abitrary exclusion criteria used at the start of the experiment persists or if the exclusion criteria causes premature termination of the experiment. That being said, it looks like we are trying to test for the influence of an undefined variable. The standard "simple" test for this is to run an experiment assuming the variable and compare it to a control where the variable is not assumed. A control is needed in order for you to compute the statistical significance off of data that is statistically relevant in your controlled setting, which it may reduce external validity (because it introduces the potential that your control is flawed), but greatly increases internal validity, which in most cases is preferred. In the case of the simple coin flip example you proposed, answering the question of whether the coin is fair is impossible to do in one experiment (or rather one experiment in one part), and is a classic example of overextending the scope of an investigation. What we can do, however, is start with the hypothesis that the probability of receiving heads or tails is affected by whether or not you first got a head. Notice that this experiment is highly restricted in scope. This is because it allows us a much clearer conclusion for the statistical results we can obtain. We support or reject a specific statement, and not an overarching idea. To proceed with the experiment we need to set up an experimental group where it is assumed getting heads will result in a different probability set. Now one way you might set up the experimental vs control group is to have the experimental group collect only trials that start with heads and the control group collect only trials that start with tails. This would appear valid since in this experiment there are only two possible starting points. However, this experiment would increase the potential for false positives because it assumes an established dichotomy, and so, the experimenter is subtly influenced to gravitate towards results that might demonstrate dichotomy. A better experiment would have one experimental group that collected only trials starting with heads, a counter experimental group that collected only trials starting with tails, and a control group without restriction on when to start collection at all. The fair experiment would, after critical sample size is reached for every group, calculate whether there is a significant deviation from the control for both the experimental groups, as well as whether there is a significant deviation between the two experimental groups. Only if all three statistics are significant can we support the hypothesis that flipping a head influences the subsequent probability of heads or tails. Note that strictly speaking we cannot even conclude the complementary hypothesis: that flipping a tails influences the subsequent probability of heads or tails. Strictly speaking, a replication experiment would have to be used to prove this complement in order to not overtest a data set. Standard p-value statistical analyses decreases in accuracy the more hypotheses we test on a single dataset, and while two hypotheses probably wouldn't be considered too many, it is poor practice and introduces unwanted uncertainties to reuse a data set at all. Now to go back to the point where we want to find support for an unfair coin. Obviously the two complement experiments are a crucial part to this investigation. Now if we were to define fairness as in the results are weighted in any way, then the two experiments may be considered sufficient, since we have shown high correlation of starting with heads to one result and starting with tails to another result and find statistical signficance between both these two results and a control result. However, if we are to narrow this fairness definition to attempt to make the assumption that flipping a head directly leads to more heads, it would require a considerable amount of followup tests. This is because now we are no longer testing for the influence of an undefined variable. Instead, we are testing for the influence of a specific variable. In order to show evidence for flipping heads leading to more heads, we must rule out the influence of any other variable that might influence more heads, starting with any variables that might be related to starting with a head (such as, for example, the way you flip the coin). If this sounds unbelievably tedious, it is because it is. This is why attempts to answer causation questions are often dismissed (really an approach programatically to obtain the source algorithm, even if this constitutes illegal hacking would be more promising than a scientific approach towards proving a source algorithm). If it sounds like this system is weighted against assumptions of hidden variables influencing our RNG, that is also true, because the burden of proof lies on any assumption that deviates from common practice or the established norm. Luckily, in most logical cases, it doesn't take infinite exhaustive trials to make a point. For example, tesing whether the position of Mars could also have been an influence on getting more heads would be considered by most to be preposterous. Most likely, followup experiments would only be run for proven or highly likely methods of intentionally influencing the probability of heads, with the defending hypotheses being either that when those methods are implored unintentionally, it does not lead to influence on the probability or that a controlled environment where those methods are explicitly forbidden gives us comparable results to the parent experiment. In short, to begin the journey of checking if a coin is fair, taking just trials that start with flipping heads makes for an incomplete experiment even within the realm of requiring many more experiments. Not establishing proper controls to base the statistical deviation off of undermines the validitiy of the claims in the experiment. Going back to the point where we're trying to figure out RNG in Aigis, in order to be truly internally valid, the experiment must include a statistically signficant sample size of users committed to testing ONLY one strategy and submitting ALL of their data for each strategy that you want to test. Taking only results of a certain strategy from different people who are imploring different strategies introduces far too many variables associated with how each user implored their strategies. Taking random samples that conform to a strategy regardless of who it came from is also a valid approach, but would require higher sample size to outrule individual differences as mitigating factors (this is the internet survey approach to an experiment).